{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Wbh466Asi-5n",
    "papermill": {
     "duration": 0.024906,
     "end_time": "2021-07-23T16:17:55.704014",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.679108",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# A Case Study: The Effect of Gun Ownership on Gun-Homicide Rates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WACAA85ci-5z",
    "papermill": {
     "duration": 0.024533,
     "end_time": "2021-07-23T16:17:55.753444",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.728911",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "\n",
    "We consider the problem of estimating the effect of gun ownership on the homicide rate. For this purpose, we perform inference on $\\beta$ in the following the partially linear model:\n",
    "$$\n",
    "Y_{j, t}=\\beta D_{j,(t-1)}+g\\left(X_{j, t}, \\bar{X}_j, \\bar{X}_t, X_{j, 0}, Y_{j, 0}, t\\right)+\\epsilon_{j, t}\n",
    "$$\n",
    "$Y_{j, t}$ is the log homicide rate in county $j$ at time $t. D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership $G_{j, t}$, which is not observed. $X_{j, t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. We use $\\bar{X}_j$ to denote the within county average of $X_{j, t}$ and $\\bar{X}_t$ to denote the within time period average of $X_{j, t} . X_{j, 0}$ and $Y_{j, 0}$ denote initial conditions in county $j$. We use $Z_{j, t}$ to denote the set of observed control variables $\\left\\{X_{j, t}, \\bar{X}_j, \\bar{X}_t, X_{j, 0}, Y_{j, 0}, t\\right\\}$, so that our model is\n",
    "\n",
    "$$\n",
    " Y_{i,t} = \\beta D_{i,(t-1)} + g(Z_{i,t}) + \\epsilon_{i,t}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hOcTlYfPi-5z",
    "papermill": {
     "duration": 0.024711,
     "end_time": "2021-07-23T16:17:55.803109",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.778398",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "btLKXBrUi-5z",
    "papermill": {
     "duration": 0.025115,
     "end_time": "2021-07-23T16:17:55.854426",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.829311",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "$Y_{j,t}$ is the log homicide rate in county $j$ at time $t$, $D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership,  and  $Z_{j,t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. Assuming the firearm suicide rate is a good proxy for gun ownership, the parameter $\\beta$ is the effect of gun ownership on homicide rates, controlling for county-level demographic and economic characteristics.\n",
    "\n",
    "The sample covers 195 large United States counties between the years 1980 through 1999, giving us 3900 observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "x4283h0kwo_M"
   },
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import re\n",
    "from sklearn.model_selection import cross_val_predict, KFold\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.neural_network import MLPRegressor\n",
    "import warnings\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf\n",
    "warnings.simplefilter('ignore')\n",
    "\n",
    "np.random.seed(1234)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "g6JCZCdqvj1Z"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/gun_clean.csv\"\n",
    "data = pd.read_csv(file)\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TkxefAQ7u-SD",
    "papermill": {
     "duration": 0.025977,
     "end_time": "2021-07-23T16:17:57.064860",
     "exception": false,
     "start_time": "2021-07-23T16:17:57.038883",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Preprocessing\n",
    "\n",
    "To attempt to flexibly account for fixed heterogeneity across counties, common time factors, and deterministic time trends, we include county-level averages, time period averages, initial conditions, and the time index as additional control variables. This strategy is related to strategies for addressing latent sources of heterogeneity via conditioning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FR0sUlnYu-SD",
    "papermill": {
     "duration": 0.024998,
     "end_time": "2021-07-23T16:17:57.115009",
     "exception": false,
     "start_time": "2021-07-23T16:17:57.090011",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We first reweight time and county variables as the data are population weighted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bxCZLcWljWhC"
   },
   "outputs": [],
   "source": [
    "# Note: These data are population weighted. Specifically,\n",
    "# looking at the JBES replication files, they seem to be multiplied\n",
    "# by sqrt((1/T sum_t population_{j,t})/100000). To get the\n",
    "# unweighted variables need to divide by this number - which we can\n",
    "# get from the time effects. We are mostly just going to use the weighted\n",
    "# variables as inputs - except for time and county. We'll take\n",
    "# cross-sectional and time series means of these weighted variables\n",
    "# as well. Note that there is nothing wrong with this, but it does not\n",
    "# reproduce a weighted regression in a setting where covariates may\n",
    "# enter nonlinearly and flexibly.\n",
    "\n",
    "\n",
    "# County FE\n",
    "county_vars = data.filter(like='X_J')\n",
    "\n",
    "# Time variables and population weights\n",
    "# Pull out time variables\n",
    "time_vars = data.filter(like='X_T')\n",
    "\n",
    "# Use these to construct population weights\n",
    "popweights = time_vars.sum(axis=1)\n",
    "\n",
    "# Unweighted time variables\n",
    "time_vars = time_vars.div(popweights, axis=0)\n",
    "\n",
    "# For any columns with only zeros, drop them\n",
    "time_vars = time_vars.loc[:, (time_vars != 0).any(axis=0)]\n",
    "\n",
    "# Create time index\n",
    "time_ind = (time_vars * np.arange(1, 21)).sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cemPVzGnjred"
   },
   "source": [
    "Now we create initial conditions, county-level averages, and time period averages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wK9LcsrwjW-w"
   },
   "outputs": [],
   "source": [
    "# Function to find variable names\n",
    "def varlist(df=None, type=[\"numeric\", \"factor\", \"character\"], pattern=\"\", exclude=None):\n",
    "    vars = []\n",
    "    if any(t in type for t in [\"numeric\", \"factor\", \"character\"]):\n",
    "        if \"numeric\" in type:\n",
    "            vars += df.select_dtypes(include=[\"number\"]).columns.tolist()\n",
    "        if \"factor\" in type:\n",
    "            vars += df.select_dtypes(include=[\"category\"]).columns.tolist()\n",
    "        if \"character\" in type:\n",
    "            vars += df.select_dtypes(include=[\"object\"]).columns.tolist()\n",
    "\n",
    "    if exclude:\n",
    "        vars = [var for var in vars if var not in exclude]\n",
    "\n",
    "    if pattern:\n",
    "        vars = [var for var in vars if re.search(pattern, var)]\n",
    "\n",
    "    return vars\n",
    "\n",
    "\n",
    "# Census control variables\n",
    "census = []\n",
    "census_var = [\"^AGE\", \"^BN\", \"^BP\", \"^BZ\", \"^ED\", \"^EL\", \"^HI\", \"^HS\",\n",
    "              \"^INC\", \"^LF\", \"^LN\", \"^PI\", \"^PO\", \"^PP\", \"^PV\", \"^SPR\", \"^VS\"]\n",
    "\n",
    "for pattern in census_var:\n",
    "    census += varlist(df=data, pattern=pattern)\n",
    "\n",
    "# Other control variables\n",
    "X1 = [\"logrobr\", \"logburg\", \"burg_missing\", \"robrate_missing\"]\n",
    "X2 = [\"newblack\", \"newfhh\", \"newmove\", \"newdens\", \"newmal\"]\n",
    "\n",
    "# \"Treatment\" variable\n",
    "d = \"logfssl\"\n",
    "\n",
    "# Outcome variable\n",
    "y = \"logghomr\"\n",
    "\n",
    "# New DataFrame for time index\n",
    "usedata = pd.DataFrame({'time.ind': time_ind})\n",
    "usedata['weights'] = popweights\n",
    "\n",
    "varlist_all = [y, d] + X1 + X2 + census\n",
    "for var in varlist_all:\n",
    "    usedata[var] = data[var]\n",
    "\n",
    "# Construct initial conditions\n",
    "varlist0 = [y] + X1 + X2 + census\n",
    "for var in varlist0:\n",
    "    usedata[f'{var}0'] = np.kron(usedata.loc[usedata['time.ind'] == 1, var], np.ones(20))\n",
    "\n",
    "# County means\n",
    "county_vars = pd.DataFrame(county_vars)\n",
    "for var in varlist_all:\n",
    "    usedata[f'{var}J'] = county_vars.dot(np.linalg.pinv(county_vars).dot(usedata[var]))\n",
    "\n",
    "# Time means\n",
    "time_vars = pd.DataFrame(time_vars)\n",
    "for var in varlist_all:\n",
    "    usedata[f'{var}T'] = time_vars.dot(np.linalg.pinv(time_vars).dot(usedata[var]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BlVEEVRRlefo",
    "papermill": {
     "duration": 0.02615,
     "end_time": "2021-07-23T16:18:24.461261",
     "exception": false,
     "start_time": "2021-07-23T16:18:24.435111",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vN70lvlTlefo",
    "papermill": {
     "duration": 0.02615,
     "end_time": "2021-07-23T16:18:24.513673",
     "exception": false,
     "start_time": "2021-07-23T16:18:24.487523",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Baseline OLS Estimates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ms333iuflefp",
    "papermill": {
     "duration": 0.027888,
     "end_time": "2021-07-23T16:18:24.568278",
     "exception": false,
     "start_time": "2021-07-23T16:18:24.540390",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "After preprocessing the data, as a baseline model, we first look at simple regression of $Y_{j,t}$ on $D_{j,t-1}$ without controls in the full data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wJzGaFqClnUl"
   },
   "outputs": [],
   "source": [
    "# Baseline OLS\n",
    "X = sm.add_constant(usedata['logfssl'])\n",
    "y = usedata['logghomr']\n",
    "lm0 = sm.OLS(y, X).fit(cov_type='HC3')\n",
    "vc0 = lm0.cov_params()\n",
    "coef = lm0.params['logfssl']\n",
    "std_err = np.sqrt(vc0.loc['logfssl', 'logfssl'])\n",
    "print(\"Baseline OLS:\", coef, \"(\", std_err, \")\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zQ1CuTnclnUk",
    "papermill": {
     "duration": 0.02777,
     "end_time": "2021-07-23T16:18:24.772317",
     "exception": false,
     "start_time": "2021-07-23T16:18:24.744547",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The point estimate is $0.302$ with the confidence interval ranging from 0.277 to 0.327. This\n",
    "suggests that increases in gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% then the predicted gun homicide rate goes up by 0.3%, without controlling for counties' characteristics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RMqfimTCbW-y"
   },
   "source": [
    "\n",
    "\n",
    "Next we estimate with the baseline set of controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ypiDeZoYmbus"
   },
   "outputs": [],
   "source": [
    "# Regression on baseline controls\n",
    "varlist = [d] + X1 + X2 + census\n",
    "X = sm.add_constant(usedata[varlist])\n",
    "y = usedata['logghomr']\n",
    "lmC = sm.OLS(y, X).fit(cov_type='HC3')\n",
    "vcC = lmC.cov_params()\n",
    "coef = lmC.params['logfssl']\n",
    "std_err = np.sqrt(vcC.loc['logfssl', 'logfssl'])\n",
    "print(\"OLS with Controls:\", coef, \"(\", std_err, \")\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "22N1iDA8lnUm",
    "papermill": {
     "duration": 0.043087,
     "end_time": "2021-07-23T16:18:25.122084",
     "exception": false,
     "start_time": "2021-07-23T16:18:25.078997",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<!-- Since our goal is to estimate the effect of gun ownership after controlling for a rich set county characteristics, we next include time and space averages. -->\n",
    "\n",
    "We can also run our regression with time and space averages as controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EUh79DdelcXZ"
   },
   "outputs": [],
   "source": [
    "# Regression on time and cross-sectional averages\n",
    "varlistX = X1 + X2 + census\n",
    "varlistMeans = [d] + X1 + X2 + census\n",
    "\n",
    "# Adding county and time means\n",
    "for var in varlistX:\n",
    "    varlistMeans.append(var + 'J')\n",
    "    varlistMeans.append(var + 'T')\n",
    "\n",
    "X = sm.add_constant(usedata[varlistMeans])\n",
    "y = usedata['logghomr']\n",
    "lmM = sm.OLS(y, X).fit(cov_type='HC3')\n",
    "vcM = lmM.cov_params()\n",
    "coef = lmM.params['logfssl']\n",
    "std_err = np.sqrt(vcM.loc['logfssl', 'logfssl'])\n",
    "print(\"OLS with Averages:\", coef, \"(\", std_err, \")\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vfk4t3O-nQPA"
   },
   "source": [
    "Since our goal is to estimate the effect of gun ownership after controlling for a rich set county characteristics, we now include all controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TQ0uPQvwlcZ3"
   },
   "outputs": [],
   "source": [
    "# Regression on all controls\n",
    "\n",
    "X = sm.add_constant(usedata.drop(columns=['logghomr']))\n",
    "y = usedata['logghomr']\n",
    "lmA = sm.OLS(y, X).fit(cov_type='HC3')\n",
    "vcA = lmA.cov_params()\n",
    "coef = lmA.params['logfssl']\n",
    "std_err = np.sqrt(vcA.loc['logfssl', 'logfssl'])\n",
    "print(\"OLS All:\", coef, \"(\", std_err, \")\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SkYb8j6anXg_"
   },
   "source": [
    "After controlling for a rich set of characteristics, the point estimate of gun ownership attenuates to 0.179.\n",
    "\n",
    "***NB***: In the background, `statsmodels` is dropping variables based on collinearity diagnostics. These depend on system linear algebra routines and can lead to large differences in high-dimensional or other ill-conditioned settings when using otherwise identical code across languages and/or machines.\n",
    "\n",
    "Now we turn to our double machine learning framework, employing linear and flexible estimation methods with cross-fitting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "702RF417z6-1"
   },
   "source": [
    "## DML Estimates\n",
    "\n",
    "We perform inference on $\\beta$ in the following the partially linear model:\n",
    " $$\n",
    "Y_{j, t}=\\beta D_{j,(t-1)}+g(Z_{j,t})+\\epsilon_{j, t}.\n",
    "$$\n",
    "In the first stage, using cross-fitting, we employ modern regression methods to build estimators $\\hat \\ell(Z_{j,t})$ and $\\hat m(Z_{j,t})$, where\n",
    "- $\\ell(Z_{j,t}):=E(Y_{j,t}|Z_{j,t})$\n",
    "- $m(Z_{j,t}):=E(D_{j,t}|Z_{j,t})$\n",
    "\n",
    "Using these, we obtain the estimates of the residualized quantities\n",
    "- $\\tilde Y_{j,t} = Y_{j,t}- E(Y_{j,t}|Z_{j,t})$\n",
    "- $\\tilde D_{j,t}= D_{j,t}- E(D_{j,t}|Z_{j,t})$\n",
    "\n",
    "Using these residualized quantities, we note our model can be written as\n",
    "$$\n",
    "\\tilde Y_{j,t} = \\beta \\tilde D_{j,t} + \\epsilon_{j,t}, \\quad E (\\epsilon_{j,t} |\\tilde D_{j,t}) =0.\n",
    "$$\n",
    "In the final stage, using ordinary least squares of $\\tilde Y_{j,t}$ on $\\tilde D_{j,t}$, we obtain the\n",
    "estimate of $\\beta$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Y1rLIZVx1LNv"
   },
   "source": [
    "In the following, we consider 10 different methods for the first-stage models for $\\ell(\\cdot)$ and $m(\\cdot)$ covering linear, penalized linear, and flexible methods. We also report the first-stage RMSE scores for estimating $Y$ and $D$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sakc09Jv0ggw"
   },
   "outputs": [],
   "source": [
    "def dml(X, D, y, modely, modeld, *, nfolds, classifier=False):\n",
    "    '''\n",
    "    DML for the Partially Linear Model setting with cross-fitting\n",
    "\n",
    "    Input\n",
    "    -----\n",
    "    X: the controls\n",
    "    D: the treatment\n",
    "    y: the outcome\n",
    "    modely: the ML model for predicting the outcome y\n",
    "    modeld: the ML model for predicting the treatment D\n",
    "    nfolds: the number of folds in cross-fitting\n",
    "    classifier: bool, whether the modeld is a classifier or a regressor\n",
    "\n",
    "    time: array of time indices, eg [0,1,...,T-1,0,1,...,T-1,...,0,1,...,T-1]\n",
    "    clu: array of cluster indices, eg [1073, 1073, 1073, ..., 5055, 5055, 5055, 5055]\n",
    "    cluster: bool, whether to use clustered standard errors\n",
    "\n",
    "    Output\n",
    "    ------\n",
    "    point: the point estimate of the treatment effect of D on y\n",
    "    stderr: the standard error of the treatment effect\n",
    "    yhat: the cross-fitted predictions for the outcome y\n",
    "    Dhat: the cross-fitted predictions for the treatment D\n",
    "    resy: the outcome residuals\n",
    "    resD: the treatment residuals\n",
    "    epsilon: the final residual-on-residual OLS regression residual\n",
    "    '''\n",
    "    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123)  # shuffled k-folds\n",
    "    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1)  # out-of-fold predictions for y\n",
    "    # out-of-fold predictions for D\n",
    "    # use predict or predict_proba dependent on classifier or regressor for D\n",
    "    if classifier:\n",
    "        Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n",
    "    else:\n",
    "        Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)\n",
    "    # calculate outcome and treatment residuals\n",
    "    resy = y - yhat\n",
    "    resD = D - Dhat\n",
    "\n",
    "    dml_data = pd.concat([pd.Series(resy, name='resy'), pd.Series(resD, name='resD')], axis=1)\n",
    "    ols_mod = smf.ols(formula='resy ~ 1 + resD', data=dml_data).fit()\n",
    "\n",
    "    point = ols_mod.params[1]\n",
    "    stderr = ols_mod.bse[1]\n",
    "    epsilon = ols_mod.resid\n",
    "\n",
    "    return point, stderr, yhat, Dhat, resy, resD, epsilon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "r4lz1FbliN3q"
   },
   "outputs": [],
   "source": [
    "def summary(point, stderr, yhat, Dhat, resy, resD, epsilon, X, D, y, *, name):\n",
    "    '''\n",
    "    Convenience summary function that takes the results of the DML function\n",
    "    and summarizes several estimation quantities and performance metrics.\n",
    "    '''\n",
    "    return pd.DataFrame({'estimate': point,  # point estimate\n",
    "                         'stderr': stderr,  # standard error\n",
    "                         'lower': point - 1.96 * stderr,  # lower end of 95% confidence interval\n",
    "                         'upper': point + 1.96 * stderr,  # upper end of 95% confidence interval\n",
    "                         'rmse y': np.sqrt(np.mean(resy**2)),  # RMSE of model that predicts outcome y\n",
    "                         'rmse D': np.sqrt(np.mean(resD**2))  # RMSE of model that predicts treatment D\n",
    "                         }, index=[name])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MmOSY9HSJJt0"
   },
   "outputs": [],
   "source": [
    "# OLS No Controls\n",
    "Y = usedata['logghomr']\n",
    "D = usedata['logfssl']\n",
    "Z = pd.DataFrame({\"Const\": np.ones(len(Y))})  # regression on constant\n",
    "\n",
    "modely = make_pipeline(StandardScaler(), LinearRegression())\n",
    "modeld = make_pipeline(StandardScaler(), LinearRegression())\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_OLS = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = summary(*result_OLS, Z, D, y, name='OLS No Controls')\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CZPsTCxVVUrl"
   },
   "outputs": [],
   "source": [
    "# Basic Controls\n",
    "basic_controls = [d] + X1 + X2 + census\n",
    "Z = usedata[basic_controls].drop(columns=['logfssl'])\n",
    "\n",
    "modely = make_pipeline(StandardScaler(), LinearRegression())\n",
    "modeld = make_pipeline(StandardScaler(), LinearRegression())\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_basic = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_basic, Z, D, y, name='OLS Basic Controls')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hNfPSzkzVVDB"
   },
   "outputs": [],
   "source": [
    "# All Controls\n",
    "# Regression on time and cross-sectional averages\n",
    "varlistX = X1 + X2 + census\n",
    "varlistMeans = [d] + X1 + X2 + census\n",
    "\n",
    "# Adding county and time means\n",
    "for var in varlistX:\n",
    "    varlistMeans.append(var + 'J')\n",
    "    varlistMeans.append(var + 'T')\n",
    "\n",
    "Z = sm.add_constant(usedata[varlistMeans].drop(columns=['logfssl']))\n",
    "\n",
    "modely = make_pipeline(StandardScaler(), LinearRegression())\n",
    "modeld = make_pipeline(StandardScaler(), LinearRegression())\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_all = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_all, Z, D, y, name='OLS All Controls')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NmnXjQ5uZ5a1"
   },
   "outputs": [],
   "source": [
    "# Now lets do Cross-validated Lasso, Ridge, ENet\n",
    "cv = KFold(n_splits=10, shuffle=True, random_state=123)  # shuffled k-folds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yi-3-yMCZtEK"
   },
   "outputs": [],
   "source": [
    "# Define LassoCV models with n_splits folds of cross-validation\n",
    "modely = make_pipeline(StandardScaler(), LassoCV(cv=cv))\n",
    "modeld = make_pipeline(StandardScaler(), LassoCV(cv=cv))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_LassoCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_LassoCV, Z, D, y, name='LassoCV')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AzopGWpJYdcf"
   },
   "outputs": [],
   "source": [
    "# Define RidgeCV models with n_splits folds of cross-validation\n",
    "modely = make_pipeline(StandardScaler(), RidgeCV(cv=cv))\n",
    "modeld = make_pipeline(StandardScaler(), RidgeCV(cv=cv))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_RidgeCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_RidgeCV, Z, D, y, name='RidgeCV')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yOC2nho2oXjY"
   },
   "outputs": [],
   "source": [
    "# Define ElasticNetCV models with n_splits folds of cross-validation\n",
    "modely = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio=0.5, cv=cv))\n",
    "modeld = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio=0.5, cv=cv))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_ENetCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_ENetCV, Z, D, y, name='ENetCV')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bBRrdvgwfbHp"
   },
   "outputs": [],
   "source": [
    "# DML with Random Forests. RFs don't require scaling but we do it for consistency\n",
    "modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting (computationally intensive)\n",
    "result_RF = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_RF, Z, D, y, name='RF')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jnusdWH0bF6u"
   },
   "outputs": [],
   "source": [
    "# DML with Boosted Trees\n",
    "modely = make_pipeline(StandardScaler(), GradientBoostingRegressor(max_depth=4, n_iter_no_change=5))\n",
    "modeld = make_pipeline(StandardScaler(), GradientBoostingRegressor(max_depth=4, n_iter_no_change=5))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting (computationally intensive)\n",
    "result_BT = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_BT, Z, D, y, name='Boosted Trees')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ke7GwmzBqfoO"
   },
   "outputs": [],
   "source": [
    "# DML with NNs\n",
    "modely = make_pipeline(StandardScaler(),\n",
    "                       MLPRegressor(hidden_layer_sizes=(50, 50, 50, 50),\n",
    "                                    activation='relu',\n",
    "                                    solver='adam',\n",
    "                                    alpha=0.0001,\n",
    "                                    batch_size=200,\n",
    "                                    learning_rate='constant',\n",
    "                                    learning_rate_init=0.001,\n",
    "                                    max_iter=200,\n",
    "                                    shuffle=True,\n",
    "                                    random_state=None,\n",
    "                                    tol=1e-4,\n",
    "                                    verbose=False,\n",
    "                                    warm_start=False,\n",
    "                                    momentum=0.9,\n",
    "                                    nesterovs_momentum=True,\n",
    "                                    early_stopping=True,\n",
    "                                    validation_fraction=0.2,\n",
    "                                    beta_1=0.9,\n",
    "                                    beta_2=0.999,\n",
    "                                    epsilon=1e-08,\n",
    "                                    n_iter_no_change=10)\n",
    "                       )\n",
    "modeld = make_pipeline(StandardScaler(),\n",
    "                       MLPRegressor(hidden_layer_sizes=(50, 50, 50, 50),\n",
    "                                    activation='relu',\n",
    "                                    solver='adam',\n",
    "                                    alpha=0.0001,\n",
    "                                    batch_size=200,\n",
    "                                    learning_rate='constant',\n",
    "                                    learning_rate_init=0.001,\n",
    "                                    max_iter=200,\n",
    "                                    shuffle=True,\n",
    "                                    random_state=None,\n",
    "                                    tol=1e-4,\n",
    "                                    verbose=False,\n",
    "                                    warm_start=False,\n",
    "                                    momentum=0.9,\n",
    "                                    nesterovs_momentum=True,\n",
    "                                    early_stopping=True,\n",
    "                                    validation_fraction=0.2,\n",
    "                                    beta_1=0.9,\n",
    "                                    beta_2=0.999,\n",
    "                                    epsilon=1e-08,\n",
    "                                    n_iter_no_change=10)\n",
    "                       )\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting\n",
    "result_NN = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_NN, Z, D, y, name='NN (Early Stopping)')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "syua99Emazgl"
   },
   "outputs": [],
   "source": [
    "rmses = table.iloc[:, -2:]\n",
    "rmses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "O2g0u1iR0qtT"
   },
   "outputs": [],
   "source": [
    "print(\"Lowest RMSE y: \", rmses.iloc[:, 0].idxmin())\n",
    "print(\"Lowest RMSE D: \", rmses.iloc[:, 1].idxmin())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RISoyJQ51G2g"
   },
   "outputs": [],
   "source": [
    "# DML with best model, which is RF\n",
    "modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "\n",
    "# Run DML model with nfolds folds of cross-fitting (computationally intensive)\n",
    "result_best = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)\n",
    "table = pd.concat([table, summary(*result_best, Z, D, y, name='Best')])\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Y_0ANZmw2GX2"
   },
   "outputs": [],
   "source": [
    "# Least Squares Model Average\n",
    "yhat_all = pd.concat([\n",
    "    pd.Series(result_OLS[2]),\n",
    "    pd.Series(result_basic[2]),\n",
    "    pd.Series(result_all[2]),\n",
    "    pd.Series(result_LassoCV[2]),\n",
    "    pd.Series(result_RidgeCV[2]),\n",
    "    pd.Series(result_ENetCV[2]),\n",
    "    pd.Series(result_RF[2]),\n",
    "    pd.Series(result_BT[2]),\n",
    "    pd.Series(result_NN[2])\n",
    "], axis=1)\n",
    "\n",
    "Dhat_all = pd.concat([\n",
    "    pd.Series(result_OLS[3]),\n",
    "    pd.Series(result_basic[3]),\n",
    "    pd.Series(result_all[3]),\n",
    "    pd.Series(result_LassoCV[3]),\n",
    "    pd.Series(result_RidgeCV[3]),\n",
    "    pd.Series(result_ENetCV[3]),\n",
    "    pd.Series(result_RF[3]),\n",
    "    pd.Series(result_BT[3]),\n",
    "    pd.Series(result_NN[3])\n",
    "], axis=1)\n",
    "\n",
    "ma_y = sm.OLS(usedata['logghomr'], yhat_all).fit()\n",
    "ma_d = sm.OLS(usedata['logfssl'], Dhat_all).fit()\n",
    "\n",
    "weights_y = ma_y.params\n",
    "weights_d = ma_d.params\n",
    "\n",
    "lm_k = sm.OLS(ma_y.resid, ma_d.resid).fit(cov_type='HC3')\n",
    "lsma = pd.Series({\"estimate\": lm_k.params[0], \"stderr\": lm_k.bse[0]},\n",
    "                 name=\"Least Squares Model Average\").to_frame().T\n",
    "\n",
    "final_table = table.iloc[:, :2]\n",
    "final_table = pd.concat([final_table, lsma], axis=0)\n",
    "final_table"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 663.673787,
   "end_time": "2021-07-23T16:28:56.086168",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-07-23T16:17:52.412381",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
